week 4: overfitting/mcmc

mcmc – brms

10 islands, in a loop. Island 2 is twice as big as island 1, island 3 is three times as big as island 1, etc.

## record current position
    current = 5
  ## flip coin to generate proposal
    (coin = sample( c(-1,1) , size=1 ))
[1] 1
    (proposal = current + coin)
[1] 6
  ## move?
    (prob_move <- proposal/current)
[1] 1.2
    (current <- ifelse( runif(1) < prob_move , proposal , current ))
[1] 6

metropolis algorithm

num_weeks <- 1e5
positions <- rep(0,num_weeks)
current <- 10
for ( i in 1:num_weeks ) {
  ## record current position
    positions[i] <- current
  ## flip coin to generate proposal
    proposal <- current + sample( c(-1,1) , size=1 )
  ## now make sure he loops around the archipelago
    if ( proposal < 1 ) proposal <- 10
    if ( proposal > 10 ) proposal <- 1
  ## move?
    prob_move <- proposal/current
    current <- ifelse( runif(1) < prob_move , proposal , current )
}
data.frame(weeks = 1:1e5,positions) %>% 
  filter(weeks <=100) %>% 
  ggplot( aes(x=weeks, y=positions)) +
  geom_line()

gibbs sampling

It works by drawing samples from conditional distributions of each parameter given current values of all other parameters. The process iteratively updates one parameter at a time, eventually converging to the joint posterior distribution. Gibbs sampling is particularly effective for high-dimensional problems where direct sampling is difficult, and it’s computationally efficient because it only requires conditional distributions rather than the full joint distribution.

However, both Metropolis and Gibbs sampling are inefficient as models become more complex. For that reason, we’ll skip right ahead to Hamiltonian Monte Carlo sampling.

hamiltonian

  • kingdom in a valley

  • king drives in a random direction and at a random momentum.

    • as vehicle goes uphill, it slows down and evantually turns around.
    • as vehicle goes downhipp, it picks up speed.
  • at fixed periods of time, the vehicle stops.

  • this is really a physics simulation – think of a skateboard in bowl

    • When the log-posterior is very flat, because there isn’t much information in the likelihood and the priors are rather flat, then the particle can glide for a long time before the slope (gradient) makes it turn around.
    • When instead the log-posterior is very steep, because either the likelihood or the priors are very concentrated, then the particle doesn’t get far before turning around.
  • HMC rejects some proposals.

    • What is the rejection criterion? Because HMC runs a physics simulation, certain things have to be conserved, like total energy of the system. When the total energy changes during the simulation, that means the numerical approximation is bad. When the approximation isn’t good, it might reject the proposal.

settings

  • LEAPFROG STEPS – Each path in the simulation is divided up into a number of leapfrog steps. If you choose many steps, the paths will be long. If you choose few, they will be short.

  • STEP SIZE – The step size determines how fine grained the simulation is. If the step size is small, then the particle can turn sharply. If the step size is large, then each leap will be large and could even overshoot the point where the simulation would want to turn around.

The warmup period of Stan is figuring out what the leapfrog steps and step size should be. This uses an algorithm called a No-U-Turn Sampler (NUTS).

model specification

Let’s return to the height and weight data.

data(Howell1, package = "rethinking")
d <- Howell1
library(measurements)
d$height <- conv_unit(d$height, from = "cm", to = "feet")
d$weight <- conv_unit(d$weight, from = "kg", to = "lbs")
describe(d, fast = T)
       vars   n  mean    sd median  min    max range  skew kurtosis   se
height    1 544  4.54  0.91   4.88 1.77   5.88   4.1 -1.26     0.58 0.04
weight    2 544 78.51 32.45  88.31 9.37 138.87 129.5 -0.54    -0.94 1.39
age       3 544 29.34 20.75  27.00 0.00  88.00  88.0  0.49    -0.56 0.89
male      4 544  0.47  0.50   0.00 0.00   1.00   1.0  0.11    -1.99 0.02
d <- d[d$age >= 18, ]
d$height_c <- d$height - mean(d$height)

\[\begin{align*} w_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta (h_i - \bar{h}) \\ \alpha &\sim \text{Normal}(130, 20) \\ \beta &\sim \text{Normal}(0, 25) \\ \sigma &\sim \text{Uniform}(0, 25) \\ \end{align*}\]

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4, 
  seed = 3, 
      file = here("files/data/generated_data/m1"))

brm() is the core function for fitting Bayesian models using brms. This function translates your model above into a Stan model, which in turn defines the sampler and does the hard work. This is running in the background of your computer. There are many programs that can be used to run Stan, and you can even run Stan directly if you want.

brms::stancode(m1)
// generated with brms 2.22.0
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int<lower=1> Kc;  // number of population-level effects after centering
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // regression coefficients
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0,upper=50> sigma;  // dispersion parameter
}
transformed parameters {
  real lprior = 0;  // prior contributions to the log posterior
  lprior += normal_lpdf(b | 0, 25);
  lprior += normal_lpdf(Intercept | 130, 20);
  lprior += uniform_lpdf(sigma | 0, 50)
    - 1 * log_diff_exp(uniform_lcdf(50 | 0, 50), uniform_lcdf(0 | 0, 50));
}
model {
  // likelihood including constants
  if (!prior_only) {
    target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
  }
  // priors including constants
  target += lprior;
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}
m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m42.1"))

family specifies the distribution of the outcome family. In many examples, we’ll use a gaussian (normal) distribution. But there are many many many options for this.

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

The formula argument is what you would expect from the lm() and lmer() functions you have seen in the past. The benefit of brms is that this formula can easily handle complex and non-linear terms. We’ll be playing with more in future classes.

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

Here we set our priors. Class b refers to slope parameters or beta coefficients. Again, this argument has the ability to become very detailed, specific, and flexible, and we’ll play more with this.

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

Hamiltonian MCMC runs for a set number of iterations, throws away the first bit (the warmup), and does that up multiple times (the number of chains).

warmup defaults to iter/2, but you can set it to something else. Remember, the warmup is not simply a burn-in period. It’s used to figure out the appropriate leapfrog and step size settings. So it’s worth allowing this to be large-ish. Also note that the warmup period will generally run more slowly than the sampling period.

In general, it’s recommended that you use one short chain to debug, four longer chains for estimation, verification, and inference.

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

Remember, these are random walks through parameter space, so set a seed for reproducbility. Also, these can take a while to run, especially when you are developing more complex models. If you specify a file, the output of the model will automatically be saved. Even better, then next time you run this code, R will check for that file and load it into your workspace instead of re-running the model. (Just be sure to delete the model file if you make changes to any other part of the code.)

one more option: cores

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  cores = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

You can parallelized your chains (run them in parallel instead of in sequence) by defining how many cores you want to run. Your computer (probably) has at least 4 cores, so 4 is a good choice here. But be warned that your computer may slow down in other places (while the code is running), and may even run a bit hot.

summary(m1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ 1 + height_c 
   Data: d (Number of observations: 352) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    99.21      0.50    98.23   100.18 1.00    14807    12164
height_c     42.05      1.95    38.22    45.91 1.00    15921    12552

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     9.38      0.36     8.72    10.12 1.00    15626    12366

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

diagnostics: mixed chains

To evaluate your model, the first question to ask is “did your chains mix?”

library(bayesplot)
mcmc_trace(m1)
mcmc_rank_overlay(m1, pars=vars(b_Intercept:sigma)) +ylim(150, NA)

diagnositics: effective sample size

This is not \(N\) like the number of participants in your sample. This is the number of draws from the posterior. The effective number is an estimate of the number of independent samples from the posterior (remember probability). Markov chains are typically autocorrelated, so that sequential samples are not entirely independent. This happens when chains explore the posterior slowly, like in a Metropolis algorithm. Autocorrelation reduces the effective number of samples. Think of effective sample size like the length of a Markov chain with no autocorrelation that is the same quality as your estimate.

brms gives you two estimates for effective sample size: bulk and tail.

How many do you need? It depends on how much skewed your posterior is.

summary(m1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ 1 + height_c 
   Data: d (Number of observations: 352) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    99.21      0.50    98.23   100.18 1.00    14807    12164
height_c     42.05      1.95    38.22    45.91 1.00    15921    12552

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     9.38      0.36     8.72    10.12 1.00    15626    12366

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

diagnostics: \(\hat{R}\)

The Gelman-Rubin convergence metric \((\hat{R})\) is a comparison of between-chain variance to within-chain variance. You want this number to be as close to 1 as you can.

Note that a value of 1 doesn’t mean that you’re fine. It’s a sign of danger, but not a sign of safety.

summary(m1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ 1 + height_c 
   Data: d (Number of observations: 352) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    99.21      0.50    98.23   100.18 1.00    14807    12164
height_c     42.05      1.95    38.22    45.91 1.00    15921    12552

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     9.38      0.36     8.72    10.12 1.00    15626    12366

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

divergent transitions

A rule in physics is that energy must be conserved. Remember that HMC is literally a physics simulation. Energy won’t be conserved in the system if your king in a car/skateboarder zings out of the valley/bowl and into outer space. That’s a divergent transition.

Two primary ways to handle divergent transitions are by increasing the adapt_delta parameter, which we’ve already done a few times in previous chapters, or reparameterizing the model.

adapt_delta specifies the target average acceptance probability during the adaptation phase of sampling (warmup). The higher the delta, the smaller the step size. This makes the model more conservative, slower, and also much less likely to encounter problems. Its default is .8. Try increasing it to .9 or .99 if you have divergent transitions.